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Abstract 

We present results from numerical studies of supervised learning opera- 
tions in recurrent networks considered as graphs, leading from a given set 
of input conditions to predetermined outputs. Graphs that have optimized 
their output for particular inputs with respect to predetermined outputs are 
asymptotically stable and can be characterized by attractors which form a 
representation space for an associative multiplicative structure of input oper- 
ations. As the mapping from a series of inputs onto a series of such attractors 
generally depends on the sequence of inputs, this structure is generally non- 
commutative. Moreover, the size of the set of attractors, indicating the com- 
plexity of learning, is found to behave non-monotonically as learning proceeds. 
A tentative relation between this complexity and the notion of pragmatic in- 
formation is indicated. 



1 Introduction 



Graph theory has recently reveived increasing attraction for applications to complex 
systems in various disciplines (Gernert 1997, Paton 2002a,b, Bornholdt and Schuster 
2003). The characterization of systems (with interrelated constituents) by graphs 
(with linked vertices) is comparably general as their characterization in terms of cat- 
egories (with elements related by morphisms). Despite its generality, graph theory 
has turned out to be a powerful tool for gaining very specific insight into structural 
and dynamical properties of complex systems (see Jost and Joy 2002, Atmanspacher 
et al. 2005 for examples). 

An area of particularly intense interest, in which complex systems abound, is 
biological information processing. This ranges from evolutionary biology over ge- 
netics to the study of neural systems. Theoretical and computational neuroscience 
have become rapidly growing fields (Hertz et al. 1991, Haykin 1999, Dayan and Ab- 
bott 2001) in which graph theoretical methods have gained considerable significance 
(cf. Sejnowski 2001). 

Two basic classes of biological networks are feedforward and recurrent networks. 
In networks with purely feedforward (directed) connectivities, neuronal input is 
mapped onto neuronal output through a feedforward synaptic weight matrix. In 
recurrent networks, there are additional (directed or bi-directed) connectivities be- 
tween outputs and other network elements, giving rise to a recurrent synaptic weight 
matrix. Much recurrent modeling incorporates the theory of nonlinear and complex 
dynamical systems (cf. Smolensky 1988, see also beim Graben 2004 for discussion). 

Hopfield networks are an example of a fully recurrent network in which all con- 
nectivities are bidirectional and the output is a deterministic function of the input. 
Their stochastic generalizations are known as Boltzmann machines. Another impor- 
tant distinction with respect to the implementation of neural networks refers to the 
way in which the neuronal states are characterized: the two main options are firing 
rates and action potentials (for more details see Haykin 1999). 

A key topic of information processing in complex biological networks is learning, 
for which three basically different scenarios are distinguished in the literature (see 
Dayan and Abbott 2001, Chap. Ill): unsupervised, supervised and reinforcement 
learning. In unsupervised (also self-supervised) learning a network responds to in- 
puts solely on the basis of its intrinsic structure and dynamics. A network learns by 
evolving into a state that is constrained by its own properties and the given inputs, 
an important modelling strategy for implicit learning processes. 

In contrast, supervised learning presupposes the definiton of desired input-output 
relations, so the learned state of the network is additionally constrained by its out- 
puts. Usually, the learning process in this case develops by minimizing the difference 
between the actual output and the desired output. The corresponding optimization 
procedure is not intrinsic to the evolution of the system itself, but has to be ex- 
ternally arranged, hence the learning is called supervised. If the supervision is in 
some sense "naturalized" by coupling a network to an environment, which provides 
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evaluative feedback, one speaks of reinforcement learning. 

In this contribution we are interested in supervised learning (see Duda et al. 2000 
for a review) on small, fully recurrent networks implemented on graphs (cf. Jordan 
1998). We start with a general formal characterization in terms of dynamical systems 
(Sec. 2.1), describe how they are implemented on graphs (Sec. 2.2), and show how 
it reaches asymptotically stable states (attractors) when the learning process is 
terminated, i.e. is optimized for given inputs and (random) initial conditions with 
respect to predetermined outputs (Sec. 2.3). 

We shall characterize the learning operations by a multiplicative structure char- 
acterizing successively presented inputs in Sec. 3.1. In this context we confirm 
and specify earlier conjectures (e.g., Gernert 1997) about the non-commutativity of 
learning operations for a concrete model. In Sec. 3.2, we study how the size of the 
set of attractors representing the derived structure changes during the process for 
perfectly and imperfectly optimized networks. The number of attractors is proposed 
to indicate the complexity of learning, and in Sec. 4 this is tentatively related to 
pragmatic information as a particular measure of meaning. 



2 Supervised Learning in Recurrent Networks 
2.1 General Notation 

Let M be a set, and let M = X U B, with X n B = 0, be a partition of M into 
two disjoint subsets. If M is some closed subset of R", B may be the boundary of 
M. (Later we will specify M as the vertices of a graph, B as a set of "external" or 
"boundary" vertices, and X as a set of "internal" vertices.) 

We consider the dynamics of fields u(x,y,t) G U, where x G X , y G B, t 
represents time as parametrized discretely or continuously, and U is the space of 
admissible state values for the fields. The dynamics of u can be described by an 
equation 

F[u(x,y,t)] = 0. (1) 

For a continuous time variable and M C R ra , a typical example is the diffusion 
equation 

F[u(x,t)] = ^^-XAu(x,t) (2) 

where A is the Laplace operator and A the diffusion constant. The only constraint 
on Eq. [T] is that a state u(x, y, 0) at time t = determines uniquely the solution for 
any time t > 0. 

We now define a set of external conditions {bi : B — > U} specifying field values 
bi on B which will be kept fixed during the time evolution of the fields on M. This 
is to say that the dynamics of fields is effectively restricted to X: 

F[u{x,bi,t)] = 0. (3) 
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Since the state of the system at time t — uniquely determines the states for 
all t > 0, we can define a mapping $ t , the so-called time evolution operator, acting 
on the set of field states. For an initial state u at t = 0, $t[ii] yields the state of 
the system at t > 0. Taking into account that different external conditions initiate 
different evolutions, we have to specify the time evolution operator as a mapping 
$t,bi '■ F x — > F x , where F x is the set of states u(x) : X — > U, by the following 
construction: Let u(x,t = 0) be the initial condition for Eq. ©, then 



is the state of the corresponding solution at time t under the external condition b{. 

In principle, the state space of $ t ,bi can be the entire set of states Fx- However, 
for reasons which will become clear below, we are interested in dissipative systems 
evolving into attractors in the limit of large t. If one of the states belonging to an 
attractor is chosen as an initial condition, the image of ®tM will again be one of the 
attractor states. This allows us to reduce the number of possible states on which 
the mappings close. 

Denoting the flow operator Bi = as the input under the external condition 
bi, we now consider the set of states A e F x belonging to attractors after time t. 
Then all mappings Bi, applied to an attractor a, lead to images in A: 



In general, the set of all attractor states A does not contain a proper subset which 
is mapped onto itself by the set of mappings {Bi}; otherwise A can be reduced to 
such a subset. Each single mapping Bi may not be surjective, but the union of the 
images of all {Bi} equals A. 

Due to condition (j3J), we can define a composition of mappings Bi. In this way, 
the external conditions {bi} give rise to an associative multiplicative structure {Bi}. 
This structure is represented on the set of attractors {cij}. 

2.2 Implementation on Graphs 

We now implement the general notions developed so far on graphs (see Wilson 1985 
for an introduction to graph theory) and specify the set M as the set of vertices V 
of a graph. For simplicity we consider directed graphs with single connections for 
each direction between any two vertices and without self-loops. Such a graph gives 
rise to non- reflexive relations on V and can be represented by an adjacency matrix 
Ad. For two vertices X\ and 22 we have: 



t) 



Bi[a] e A 



for a £ A . 



(4) 




1 if there exists a directed line from X2 to x\ 
otherwise 



(5) 



If Ad is symmetric, the graph is undirected. 
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The set of vertices V is decomposed into a set of external vertices V ext and a set 
of internal vertices Vj n t. If N is the total number of vertices, N ext the number of 
external vertices and iVi n t the number of internal vertices, we have N = N cxt + A^ nt . 

Next we consider fields u(z, t) on a graph with vertices z G V evolving in discrete 
time steps t 6 N according to: 



u(z 



t+i) = f ^ u( y , t)j =f{l2 Ad ^ y>(y> *) J 



(6) 



The value of the field u at vertex z and time t + 1 depends only on the sum of the 
field values at neighboring vertices y at time t. 

The fields u(z,t) assume integer values {0, 1, . . . , J max }, and the function / is 
defined as: 

{int(7 max • (x/n )) for x < n 

int(7 max • (m - x)/ (ni - n )) for n < x < n x (7) 
for x > rii 

where int(x) denotes the nearest integer-rounded x. The function f(x) is shown in 

Fig.m 

m 



no 



X 



Fi gure 1: The function f(x) according to Eq. (J7J. The values / max = 10, no = 10, and 
n\ = 30 are used in the simulations. 

The restriction of u(z,t) to integer values implies that there is only a finite 
number of states. Starting from an arbitrary initial state in Fx, the system runs 
into an attractor after a few time steps. In many cases, this attractor is a fixed 
point, i.e. one single state that is asymptotically stable. Sometimes the attractor 
is a limit cycle, i.e. a periodic succession of several (usually few) states. Strange 
attractors do not occur since the number of states is finite. 

The external conditions are defined as fixed states on the external vertices, 
i.e., the state values on the external vertices remain unaffected by the dynamics. 
Of course, the external conditions are supposed to influence the dynamics of the 
internal vertices. 

The graphs used in our investigations consist of a total of N = 24 vertices with 
iV ex t = 16 external vertices and N- m t = 8 internal vertices. The maximal value of 
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Figure 2: The 11 input patterns bi on the 16 external vertices, represented as three basic 
types of 4 x 4-matrices: o indicates field value 0, • indicates field value I max . 



u(z,t) is defined to be J max = 10. We consider 11 different input patterns 6j which 
are shown in Fig. |21 

In order to obtain a minimal set A of attractor states under which the multipli- 
cation of the evolution operators Bi is closed, we first determine the attractor state 
di (or states i — 1, ...,m for a limit cycle of period m) corresponding to input 
£>i, starting from a random initial distribution of states in Fx- Subsequently, B 2 is 
applied to a 1; and so on until Bu provides the final attractor state (s). 

Next we apply all evolution operations Bi, . . . ,Bn to the obtained set of at- 
tractors until no new attracting states are generated. The resulting set A can be 
represented in terms of a mapping diagram. An example for such a mapping dia- 
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9 



Table 1: Example of a mapping diagram for 11 inputs Bi and a system with 10 different 
attractors aj. The entries show the number i of the attractor state which is obtained by 
applying Bi (plotted vertically) to a, (plotted horizontally). 
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gram with a relatively small number of 10 attractor states, which are all fixed-point 
attractors, is shown in Tab. ^ The corresponding field values on the eight internal 
vertices are listed in Tab. El and the corresponding adjacency matrix of the graph 
is given in Tab. 01 
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field values 
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Table 2: Configuration of field values on internal vertices for the 10 attractors aj of Tab.^ 



Ad(x, y) = for x < 16 
000000000000000000010010 
000000000000000000001101 



10 110 1110 
0100000001 
110 10 1 
1 1 
1 1 1 1 
1 



1 1 1 

110 10 10 

1 1 1 1 1 1 
10 10 10 

00010000010001 
1010101000000111 








110 
10 11 
1 
1 1 

1 

1 



Table 3: The adjacency matrix Ad for the mapping diagram in Tab. with 17 < x < 
24 plotted vertically and 1 < y < 24 plotted horizontally. Since there are no directed 
lines from internal vertices to external vertices and no lines between external vertices, 
Ad(x, y) = for x < 16, only rows x > 16 are shown. As explained in Sec. l2.3l there are no 
direct connections from the 16 external vertices to the first two internal vertices serving 
as outputs. 

From the mapping diagram one can deduce the multiplicative structure of the 
operations Bi. A simple indicator for the complexity of this structure is the minimal 
number of attractor states neccessary for the structure to close. For the structure 
corresponding to mapping diagram in Tab. ^one can see that the first eight inputs 
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inputs 


output states 


Bi 


1 


2 


1 








2 





^max 


3 





^max 


4 





^max 
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U 


j 

1 max 





■^max 
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^max 
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-^max 
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-^max 





10 


-^max 





11 


-^max 






Table 4: Optimal output states for all inputs. 

give rise to very simple relations: 

-Bi-B, = Bj for i < 8 and j arbitrary , 

representing projection operators. Furthermore, some of these elements are identical: 

B 4 = B 2 , B 5 = £? 3 , B Q = B 7 . 

The remaining three elements generate new elements of the multiplicative structure. 
Simple products of these three elements yield four relations, 

Bg = B 9 , B\ Q = B 10 , B\ x = B n , B 10 B n = B$ , 

leaving us with five new elements. The total multiplicative structure contains more 
than 20 elements. 

2.3 Learning on Graphs 

In a very elementary way, the described graphs can be used to simulate simple 
supervised learning processes. This can be achieved by considering the inputs as 
stimuli to which the rest of the graph reacts in order to produce an optimal output. 
In order to define such an optimal output, two of the internal vertices (vertex 1 
and 2 in Tab. EJ) are defined as output vertices, on which particular field values as 
given in Tab. 0] are defined as optimal. As we want to investigate how input from 
the 16 external vertices is processed onto the two output vertices by the remaining 
six internal vertices, direct connections from external vertices to output vertices are 
excluded. 
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The learning process is intended to produce field states on the six internal vertices 
which map external vertices Bi onto output states as close as possible to those given 
in Table EJ The internal structure of the graph is, thus, optimized in such a way 
that its links (connectivities) and vertices (field values) finally give rise to optimal 
output states. 

The following measure of variance serves to quantify the distance of actual output 
states u(zi) from optimal output states u(zi) opt (i = 1,2): 

30 

v = E E E - ^W) 2 • (s) 

ext. cond. t=10 output states 

The sum extends over all 11 external conditions, over 20 time steps (beginning after 
the first 10 transient time steps), and over the two output vertices. A variance v = 
implies optimal learning, i.e. an optimized structure of the six internal vertices has 
been reached. For a random graph, v is of the order of 15-20 xlO 4 . (Note that the 
fitness of the graph is related to the inverse of its variance v.) 

In order to find an optimized graph (an "optimal learner") with respect to a 
predefined input-output pattern, a random graph is used as an initial condition and 
randomly selected single-link changes (insertion or deletion of a directed link) are 
offered successively, implying changes of the state values on internal and output ver- 
tices according to Eqs. (6) and (7). (Note that this strategy differs from optimization 
based on changing the strength of links, e.g. by Hebb's rule.) The initial random 
graph contains only undirected links, and there are no connections of input-input, 
input-output, and output-output vertices. 

If the variance of a graph after a link change decreases, it is accepted, otherwise 
rejected. In this way, a sequence of graphs is generated with improving output 
behavior. In many cases the sequence terminates with a variance much larger than 
(between 10 2 and 10 4 ). In such cases the evolution of the graph ends in a local 
minimum far away from optimal behavior. In other cases the sequence ends with 
an optimal learner, v — 0. 

3 Non-Commutativity of Inputs and 

Non-Monotonic Complexity of Learning 

3.1 Output Dependence on the Sequence of Inputs 

The inputs Bi for the learning process are always presented in the same sequence 
from i — 1 up to i — 11. Each input is presented for 30 time steps, after which 
the next input follows. Except the random initialization of the fields on the internal 
and output vertices at the beginning of the learning run, there is no randomization 
when inputs are changed. In this case, the field values start with the attractor state 
of the previous input. 
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Table 5: The mapping diagram for a perfect learner with 11 inputs Bi and 11 attractors 
dj. The entries show the number i of the attractor state which is obtained by applying Bi 
(plotted vertically) to a% (plotted horizontally). 

It turns out that the graphs not only learn to provide optimal outputs for indi- 
vidual inputs, but they learn to do so for particular sequences of inputs. In most 
cases, input i + 1 is recognized correctly (in the sense that the fields on the output 
vertices assume the optimal values) only if the previous input % was recognized cor- 
rectly and the starting configuration of the fields for input i + 1 corresponds to the 
attractor for input i. 

The multiplicative structure introduced above expresses how sensibly the reac- 
tion of graphs to the presentation of an input depends on previous inputs. For 
"perfect learners" , optimally recognizing each input independently of the previous 
configuration, the multiplicative structure of the inputs is quite trivial: for any ini- 
tial state, each input operation Bi simply projects the system onto its corresponding 
attractor. This gives rise to a mapping diagram as in Tab. El 

The multiplicative structure associated with Tab. El consists of the 11 elements 
Bi which are idempotent, 

Bi = B t for all i , (9) 

and satisfy the relation 

BiBj = Bi for all i,j, (10) 
hence they are non-commutative, though associative: 

B l {B ] B k ) = {B l B j )B k = B i foralH,j,fc. (11) 

Since the optimal reaction of a graph to an input is not uniquely related to that 
input, the attractor providing an optimal output can be identical for different inputs. 
Therefore, the multiplicative structure of input operations can be even simpler in 



10 



the sense that some of the attractors are identical. Table Q shows a corresponding 
example with less than 11 attractors. 

Deviations from Eq. (jlUj) indicate a more complicated structure of learning op- 
erations. If the elements in the same row (i.e. for the same input) of the mapping 
diagram differ from each other, the reaction of the graph with respect to an input 
depends on the previous input. This means that the result of a learning process 
depends on the sequence in which successive learning steps are carried out. This 
implies that the multiplicative structure of input operations deviates from Eq. (JTUJ). 
Since the Bi are mappings, associativity is valid trivially. However, the structure 
will generally be non- commutative, 

BjBj 7^ BjBi , (12) 

although it may happen that particular inputs commute, for instance when they 
project onto the same attractor, such as B 2 and _B 4 , or B 3 and B 5 , or B$ and B 7 in 
Tab.CJ 

We can now understand how an optimal learner differs from a perfect learner, 
which recognizes inputs independently of the sequence of their presentation. Com- 
paring Tabs.|2]and|Ushows that attractor a% leads to the optimal output (field values 
on the first two vertices) for input B±, attractors a 2 and 03 yield the optimal output 
for inputs B 2 — B 5 , and attractors 04 and a 5 yield the optimal output for inputs 
Bq — B\i. In these cases, optimal learning coincides with perfect learning. 

From Tab.^we see that inputs B\ — B§ are recognized independently of previous 
inputs. By contrast, inputs B 9 , B w and Bu are recognized correctly only if the 
previous input is B 8 , B 9 and B w , respectively. Table|2]shows that attractors a 7 — a 10 
lead to an "almost" correct output for inputs B 9 — Bu, and the output of a,Q differs 
considerably from any optimal output. Although these situations represent optimal 
learning, they are different or even far from perfect learning. 

If the attractor for a particular input does not consist of one single state (fixed 
point), but of a perodic sequence of states (limit cycle), idempotency El does no 
longer hold. (Strictly speaking, this is only correct if the number of time steps t in 
the mapping Bi = <&t,bi and the length of the cycle have no common denominator. 
Otherwise, the attractor may consist of more states than can be detected by the 
mapping diagram or the set of inputs B^.) 

Note that the structure of learning operations derived here is more general than 
an algebra (as conjectured by Gernert 2000). There is no identity element, there is 
no neutral element, and no addition of the elements Bi is defined. 

3.2 Number of Attractors Versus Variance 

In order to investigate the evolution of the set of attractors during the learning 
process, we focus on the number N of attractor states as a function of learning steps 
for the entire sequence of graphs starting from a random graph until a graph with 
optimal learning is reached. Since a large number of attractors intuitively relates 
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to quite complex structures of the graph during the learning process, we propose to 
refer to the size of the set of attractors as a possible measure for the complexity of 
learning. However, it should be emphasized that a rigorous definition of complexity 
(cf. Wackerbauer et al. 1994) is not yet associated with this notion. 

Initially, the graphs are (almost) random and exhibit large variances of the order 
of 2 x 10 4 . For these graphs the number of attractor states with respect to the inputs 
varies over a large range; typical are numbers between 30 and 50. As learning begins, 
the variance decreases, but the number of attractor states increases, sometimes up 
to a few hundred. A further decrease in variance, below a value of 6000, causes 
the number of attractor states to decrease again. For optimal learners (graphs with 
vanishing variance) the number of attractor states terminates at around N = 10. A 
typical example is shown in Fig. |21 

180 I 1 1 1 1 1 1 1 1 1 

+ 

160 - + + * + 

+ + 

140 - 
120 - 

+ 

100 - 

80 - + + 

60 - 




2000 4000 6000 8000 10000 12000 14000 16000 18000 

Figure 3: Number N of attractor states (vertical axis) as a function of variance v (hor- 
izontal axis), starting from a random graph with v ~ 18000 and terminating at a graph 
with optimal response and v = 0. The points refer to those graphs which were accepted 
during the learning process, so that decreasing variance indicates progressive learning. 
The non-monotonic behavior of the complexity of learning is clearly visible. 

We now select a sample of 116 learning sequences starting from random graphs 
and terminating as (almost) optimal learners. For this sample we count the number 
of attractor states, i.e. the complexity of learning, for those graphs which were 
accepted during the process, i.e., for which the variance was always smaller than for 
any previous graph in the sequence. Their behavior can be seen in Fig. EJ where N 
is plotted as a function off. It confirms the impression from Fig. El that, as learning 
proceeds, its complexity evolves non-monotonically. 

In about 50% of the cases the sequence started with less than N = 50 attractor 
states. The final N was much smaller, and for intermediate stages of learning N 
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1000 



800 - 




Figure 4: Number N of attractor states (vertical axis) as a function of variance v (hor- 
izontal axis) for 116 learning sequences starting from random graphs and terminating as 
(almost) optimal learners with a variance of below 10. The plot shows only those graphs 
which were accepted during learning. The non-monotonic behavior of the complexity of 
learning for optimal learners is clearly visible. 




Figure 5: Number N of attractor states (vertical axis) as a function of variance v (hor- 
izontal axis) for 98 learning sequences starting from random graphs and terminating as 
non-optimal learners with a variance of above 9700. The plot shows only those graphs 
which were accepted during learning. The non-monotonic behavior of the complexity of 
learning is visible for non-optimal learners as well. 
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reached a maximum during the learning process. In about 85% of all cases the final 
number of attractor configurations was smaller than 20. The largest final number 
of attractor states for an optimal learner was 56. 

Exceptions from this behavior occur if the initial (random) graph has a number 
of attractor states that is extremely large, exceeding any other number of attractor 
states in the sequence. For this case we find a total number of 15 sequences. In 12 of 
these sequences the initial number of attractors is larger than 100 (with a maximum 
of 747). 

Figure El shows a plot of number of attractors as a function of variance for 98 non- 
optimal learners whose final variance is v > 9700. Keeping in mind that decreasing 
variance corresponds to progressive learning, the general trend of Figs. E3 and 0] 
reappears: the size of the set of attractors, i.e. the complexity of learning, evolves 
non-monotonically as learning proceeds. 

As the main observation of the present subsection, we can state that the num- 
ber N of attractors required to optimally map a given input onto a predetermined 
output evolves non-monotonically during the process of learning. While N increases 
during the initial phase of learning, it decreases again until the learning process 
is terminated. We interpret this behavior as a non-monotonic complexity of the 
learning process. 

4 Is the Complexity of Learning 
Related to Meaning? 

Non-monotonic as opposed to monotonic measures of complexity have been de- 
veloped and investigated for about two decades; for a comparative overview see 
Wackerbauer et al. (1994). The property of monotonicity is usually understood as 
a function of (some measure of) randomness of the pattern or process considered. 
Monotonic complexity essentially increases as randomness increases: most random 
features are also most complex. Non-monotonic complexity shows convex behavior 
as a function of increasing randomness: highest complexity is assigned to features 
with a mixture of random and non-random elements, while both very low and very 
high randomness yield minimal complexity. 

There is an interesting relationship between the two classes of complexity mea- 
sures and measures of information; for more details see Atmanspacher (1994) or 
Atmanspacher (2005). It turns out that monotonic complexity usually corresponds 
to syntactic information, whereas non-monotonic (convex) complexity corresponds 
to semantic information or other measures of meaning (see Fig. IHJ). 

As a particularly interesting approach, pragmatic information has been pro- 
posed (Weizsacker 1972) as an operationalized measure of meaning. Its essence is 
that purely random messages keep providing complete novelty (or primordiality) 
as they are delivered, while purely non-random messages keep providing complete 
confirmation (after initial transients). Pragmatic information refers to meaning in 
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▲ monotonic complexity 
syntactic information 




randomness 



A convex complexity 

pragmatic information 




randomness 



Figure 6: Schematic illustration of two different classes of complexity measures, corre- 
sponding to different information measures and distinguished by their functional depen- 
dence on randomness. 



terms of a mixture of confirmation and novelty. Extracting meaning from a mes- 
sage depends on the capability to transform novel elements into knowledge using 
confirming elements. 

It has been speculated (Atmanspacher 1994) that systems having this capacity 
are able to reorganize themselves in order to flexibly modify their complexity relative 
to the task that they are supposed to solve. A learning process, in which insight is 
gained and meaning is understood, may start at low complexity (high randomness, 
much novelty) and terminate at low complexity (high regularity, much confirmation), 
but it passes through an intermediate stage of maximal complexity. 

The notion of pragmatic information was earlier utilized in this sense for non- 
equilibrium phase transitions in multimode lasers (Atmanspacher and Scheingraber 
1990). It could be shown that a particular well-defined type of pragmatic infor- 
mation, adapted to that case, behaves precisely as indicated above. Pragmatic 
information is maximal at the unstable stage of the phase transition, and it is low 
in the preceding and successive stages. However, lasers are physical systems, and 
it is problematic to ascribe something like an "understanding of meaning" to their 
behavior. 
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Biological networks such as studied in this paper are more realistic systems for a 
concrete demonstration of the basic idea. The non-monotonic complexity of learning 
processes as indicated in Sec. 3.2 starts with random graphs and ends with graphs 
of minimized variance (maximized fitness), which are as non-random as possible 
under the given conditions. In this sense, a scenario has been established in which 
the complexity of learning on graphs qualitatively satisfies the conditions required 
for relating it to a measure of pragmatic information. Within this scenario, our 
approach suggests that the actual "release of meaning" during learning does not 
occur when the output is optimized but rather when the complexity is maximized. 

It is a long-standing desideratum to identify meaning-related physiological fea- 
tures in the brain (Freeman 2003). Since learning is a key paradigm in which the 
emergence of meaning can be studied, we hope that our approach may offer a useful 
perspective for progress concerning this problem. 

5 Summary 

In this contribution an example of supervised learning in recurrent networks of small 
size implemented on graphs is studied numerically. The elements of the network are 
treated as vertices of graphs and the connections among the elements are treated 
as links of graphs. Eleven inputs and two outputs are predefined, and the learning 
process within the remaining six internal vertices is carried out such as to minimize 
the difference between the actual output and the predetermined output. Optimiza- 
tion of outputs is achieved by stable configurations at the internal vertices that can 
be characterized as attractors. 

Two particular features of the learning behavior of the network are investigated 
in detail. First, it is shown that, in general, the mapping from inputs to outputs 
depends on the sequence of inputs. Thus, the associative multiplicative structure of 
input operations represented by sets of attractors is, in general, non-commutative. 
Second, the size of the set of attractors changes as the learning process evolves. 
With increasing optimization (fitness), the number of attractors increases up to a 
maximum and then decreases down to a usually small final set for optimal network 
performance. 

Assuming that the size of the set of attractors indicates the complexity of learn- 
ing, its non-monotonic behavior is of special interest. Since non-monotonic measures 
of complexity can be related to pragmatic information as a measure of meaning, it is 
tempting to consider the maximum of complexity as reflecting the release of meaning 
in learning processes. Further work will be necessary to substantiate this specula- 
tion. 
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